{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2025 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Video Captioning with Gemini\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fmultimodal-data-curation%2Fcaptioning.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/captioning.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "84f0f73a0f76"
      },
      "source": [
        "| Author(s) |\n",
        "| --- |\n",
        "| [Noa Ben-Efraim](https://github.com/noabenefraim) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This  notebook details the application of the Gemini model for video captioning within our Video Language Model (VLM) training data curation pipeline. \n",
        "\n",
        "This process is a critical step because it establishes the necessary textual correspondence for effective multi-modal learning. By generating rich captions that describe the visual and auditory content of the video, we enable the VLM to learn joint embeddings and understand the complex interplay between different modalities. High-quality captions serve as a structured representation of the video's semantic content, directly contributing to the performance of the trained VLM on various downstream tasks. \n",
        "\n",
        "This notebook will cover:\n",
        "+ How to set up a meaningful system prompt to cater your dataset \n",
        "+ Detailed example that highlights Gemini's multi-modal capabilities that capture audio, visual, and text inputs from the video."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f4cbc20d4056"
      },
      "source": [
        "## Important Note\n",
        "\n",
        "This notebook explores methods for video captioning. However, it is crucial to understand that using the output or capabilities of the Gemini API to train other models that compete with Gemini API or Google AI Studio is a violation of the [Gemini API Additional Terms of Service](https://ai.google.dev/gemini-api/terms).\n",
        "\n",
        "Specifically, the terms state under \"Use Restrictions\": \"You may not use the Services to develop models that compete with the Services (e.g., Gemini API or Google AI Studio).\" Attempting to use data generated by Gemini, such as video captions produced by the API, as a dataset for training a separate video captioning model could fall under this restriction.\n",
        "\n",
        "Users are responsible for ensuring their use of the Gemini API complies with all applicable terms and policies. Proceeding with actions that violate the terms of service may result in suspension or termination of access to the service."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "61RBz8LLbxCR"
      },
      "source": [
        "## Get started"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "No17Cw5hgx12"
      },
      "source": [
        "### Install Google Gen AI SDK and other required packages\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tFy3H3aPgx12"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade --quiet google-genai"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dmWOrTJ3gx13"
      },
      "source": [
        "### Authenticate your notebook environment (Colab only)\n",
        "\n",
        "If you're running this notebook on Google Colab, run the cell below to authenticate your environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NyKGtVQjgx13"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DF4l8DTdWgPY"
      },
      "source": [
        "### Set Google Cloud project information\n",
        "\n",
        "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n",
        "\n",
        "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "Nqwi-5ufWp_B"
      },
      "outputs": [],
      "source": [
        "# Use the environment variable if the user doesn't provide Project ID.\n",
        "import os\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"  # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n",
        "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n",
        "    PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n",
        "\n",
        "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n",
        "BUCKET = \"[your-gcs-bucket]\"\n",
        "VIDEO_NAME = \"[your-video-name]\"\n",
        "\n",
        "from google import genai\n",
        "\n",
        "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5303c05f7aa6"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6fc324893334"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from typing import Any\n",
        "\n",
        "from google import genai\n",
        "from google.genai import types"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "01601b5549d5"
      },
      "source": [
        "### Sample Video\n",
        "\n",
        "This notebook will use a sample video clip from the VidGen_1M dataset to illustrate the application of Gemini for comprehensive video captioning. Before proceeding, please review the video. You will observe a rich array of multi-modal information, including: textual elements present in tickets, on-screen graphics, and illustrations; visual elements depicting the weatherman, their attire, and the information conveyed through graphics and maps; and distinct audio elements comprising the weatherman's spoken content that is not explicitly represented visually or textually.\n",
        "\n",
        "Please refer to the sample video found [here](gs://github-repo/generative-ai/gemini/use-cases/multimodal-data-curation/_sNnSQ6cQYk-Scene-0006.mp4)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a71e75815bdf"
      },
      "source": [
        "<video width=\"320\" height=\"240\" controls>\n",
        "  <source src=\"_sNnSQ6cQYk-Scene-0006.mp4\" type=\"video/mp4\">\n",
        "</video>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "917ae545862c"
      },
      "source": [
        "#### System Prompt\n",
        "\n",
        "Below we have outlined a system prompt to use. This system prompt is designed to guide Gemini in generating detailed video captions that focus on capturing dynamic elements like camera movement and subject motion. Key instructions emphasize describing the scene from a specific perspective, using vivid language to illustrate actions and atmosphere, and including details about the setting, subject appearance, body language, and any visible text. The overarching goal is to create rich, nuanced captions that provide a comprehensive understanding of the video's content and context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "af61d03f1280"
      },
      "outputs": [],
      "source": [
        "system_instructions = \"\"\"Generate a detailed video caption that prioritizes capturing the scene's motion, perspective, and nuanced descriptions. Focus on using strong verbs and adverbs to convey the action and atmosphere. Only return the caption. Do not send a preamble. \n",
        "\n",
        "# Key Elements to Emphasize\n",
        "\n",
        "* Perspective and Camera Movement: Specify the viewer's position relative to the scene (e.g., \\\"from the perspective of a toddler,\\\" \\\"from a low angle,\\\" \\\"from a high angle,\\\" \\\"a wide shot\\\").\n",
        "* Describe any camera movement, including direction (e.g., \\\"strafing left,\\\" \\\"panning right,\\\" \\\"zooming in,\\\" \\\"tracking the subject\\\").\n",
        "* Describe the relative size of the subject in the frame.\n",
        "* Subject Motion and Actions: Use vivid verbs to illustrate the subject's actions (e.g., \\\"standing,\\\" \\\"sitting,\\\" \\\"walking,\\\" \\\"bowing,\\\" \\\"playing,\\\" \\\"illuminating\\\").\n",
        "* Employ adverbs to enhance the description of the actions (e.g., \\\"slowly strafing,\\\" \\\"intensely focusing,\\\" \\\"dimly lit,\\\" \\\"dramatically illuminating\\\").\n",
        "* Describe the direction of motion.\n",
        "* Describe the speed of motion.\n",
        "* Scene Composition and Atmosphere: Describe the setting in detail (e.g., \\\"grand piano on a stage,\\\" \\\"empty auditorium,\\\" \\\"dimly lit theater\\\").\n",
        "* Capture the emotional tone or atmosphere (e.g., \\\"dramatic,\\\" \\\"solitude,\\\" \\\"introspection\\\").\n",
        "* Describe the lighting.\n",
        "* Describe the relative positions of the objects in the scene.\n",
        "* Subject Details: Provide specific details about the subject's appearance (e.g., \\\"man in a gray suit\\\").\n",
        "* Describe the subjects body language (e.g. \\\"takes a deep bow that wraps his arms around his back and abdomen\\\").\n",
        "* Include references to text that is visible in the video.\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e43229f3ad4f"
      },
      "source": [
        "#### Load model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cf93d5f0ce00"
      },
      "outputs": [],
      "source": [
        "MODEL_ID = \"gemini-2.0-flash\"  # @param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1294dfb411f9"
      },
      "source": [
        "#### Define helper functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5c0804536c54"
      },
      "outputs": [],
      "source": [
        "def get_captions_from_video(\n",
        "    video_uri: str,\n",
        "    model: str = MODEL_ID,\n",
        "    system_instructions: str = system_instructions,\n",
        "    temperature: float = 0.7,\n",
        "    count_video_tokens: bool = True,\n",
        ") -> dict[str, Any]:\n",
        "\n",
        "    contents = [\n",
        "        types.Part.from_text(text=\"Caption this video\"),\n",
        "        types.Part.from_uri(file_uri=video_uri, mime_type=\"video/mp4\"),\n",
        "    ]\n",
        "\n",
        "    video_token_count = None\n",
        "    if count_video_tokens:\n",
        "        video_token_count = client.models.count_tokens(\n",
        "            model=\"gemini-2.0-flash-001\",\n",
        "            contents=[types.Part.from_uri(file_uri=video_uri, mime_type=\"video/mp4\")],\n",
        "        ).total_tokens\n",
        "\n",
        "    response = client.models.generate_content(\n",
        "        model=model,\n",
        "        contents=contents,\n",
        "        config=types.GenerateContentConfig(\n",
        "            system_instruction=system_instructions,\n",
        "            temperature=temperature,\n",
        "        ),\n",
        "    )\n",
        "\n",
        "    return {\n",
        "        \"prompt_token_count\": response.usage_metadata.prompt_token_count,\n",
        "        \"candidates_token_count\": response.usage_metadata.candidates_token_count,\n",
        "        \"total_token_count\": response.usage_metadata.total_token_count,\n",
        "        \"model_version\": response.model_version,\n",
        "        \"text\": response.candidates[0].content.parts[0].text,\n",
        "        \"video_only_token_count\": video_token_count,\n",
        "    }"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7008b586e58b"
      },
      "outputs": [],
      "source": [
        "# To test this yourself, run this cell\n",
        "response = get_captions_from_video(\n",
        "    \"gs://github-repo/generative-ai/gemini/use-cases/multimodal-data-curation/_sNnSQ6cQYk-Scene-0006.mp4\"\n",
        ")\n",
        "response"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "de8be65f8c7f"
      },
      "source": [
        "#### Sample output\n",
        "\n",
        "_From the perspective of a news viewer, <span style=\"color: red;\"> the camera remains fixed </span> on a weather report. <span style=\"color: red;\"> A bald, middle-aged Caucasian man in a dark suit and magenta tie stands confidently on the right side of the screen.</span> He <span style=\"color: red;\"> gestures </span> toward a <span style=\"color: red;\">graphic displaying the Omaha, Nebraska skyline </span>  at night, complete with <span style=\"color: red;\"> falling snow effects</span> . The graphic boldly states <span style=\"color: blue;\">\"TONIGHT 27° Wintry Mix\" </span>, with wind from the North-Northwest at 10-15 mph. The threat tracker is listed as <span style=\"color: blue;\">\"High.\"</span>, <span style=\"color: purple;\">The man explains that the wintry mix will end tonight, but that everything will be iced over as the temperature drops into the upper 20s.</span> A smaller graphic on the lower left shows a map of the region, with varying shades of purple indicating <span style=\"color: blue;\">\"Ice Storm Warning\"</span>, for <span style=\"color: red;\">several counties, including Washington, York, Madison, and Wayne</span>. <span style=\"color: blue;\">A banner at the bottom of the screen warns of snapped power lines and falling tree branches due to ice accumulation.</span> As the man continues his report, the graphic switches to <span style=\"color: blue;\">\"TOMORROW 36° Partly Sunny\"</span>, with the threat tracker listed as <span style=\"color: blue;\">\"Low.\"</span> <span style=\"color: purple;\">He concludes that sunshine will reemerge tomorrow and that temperatures will climb back to about 36 degrees.</span>_\n",
        "\n",
        "Captured from: <span style=\"color: red;\"> video</span>, <span style=\"color: blue;\">text</span>, <span style=\"color: purple;\">audio</span>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a9db47478fd4"
      },
      "source": [
        "The detailed and multi-modal captions generated by Gemini are highly valuable for downstream tasks such as embedding. By capturing a rich array of visual, textual, and implied auditory information, these captions provide a strong semantic representation of the video content. \n",
        "\n",
        "Next in the data curation pipeline, we will embed these comprehensive captions allows for effective semantic deduplication within the data curation pipeline, enabling the identification and removal of near-duplicate video content based on their meaning and context, rather than just pixel-level similarity. This ensures a cleaner and more diverse dataset for VLM training."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "captioning.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
